#/Users/rahul/Documents/enjoy/UdaCity/NanoDegree/2DataScienceND2/R_Project/R_Project_Udacity/p2-explore_and_summarize_data

title: “R Notebook” output: html_notebook —

Explore and Summarize white wine product by Rahul Kumar

My Data set consist of 4898 white wines with 11 variables Data fields Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Other:

13 - id (unique ID for each sample, needed for submission)

#library(ggplot2)
#install.packages('knitr',dependencies = T)
#install.packages("lmtest", repos = "http://cran.us.r-project.org")
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(knitr)
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
## 
##     percent
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:memisc':
## 
##     collect, recode, rename
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
list.files()
## [1] "explore_and_summarize_data_files"  
## [2] "explore_and_summarize_data.html"   
## [3] "explore_and_summarize_data.nb.html"
## [4] "explore_and_summarize_data.Rmd"    
## [5] "wineQualityWhites.csv"
pf=read.csv('wineQualityWhites.csv')
names(pf)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
head(pf)
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

Summary

summary(pf)
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

PLOT EACH VARIABLE

names(pf)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
ggplot(aes(x=fixed.acidity,y=quality),data=pf)+
      geom_histogram(stat="identity",binwidth = 1) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

summary(pf$fixed.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

We can see that minimum fixed acidity is 3.8 and Max fixed acidity is 14.2 .Quality of wine is incresing till median value of fixed acidity then it start decresing by incresing fixed.acidity

ggplot(aes(x=volatile.acidity,y=quality),data=pf)+
      geom_histogram(stat="identity",binwidth = 1) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

summary(pf$volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

We can see that min volatile acidity is 0.080 and max volatile acidity is 1.100 . We can see that quality of wine increses till median value 0.2600 and then it start decresing

ggplot(aes(x=citric.acid,y=quality),data=pf)+
      geom_histogram(stat="identity",binwidth = 1) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

summary(pf$citric.acid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

We can see that min citric acid is 0.00 and max citric acid is 1.66.We can see that wine quality increses till median value 0.32 and then by incresing citric acid wine quality decreses.

ggplot(aes(x=residual.sugar,y=quality),data=pf)+
      geom_histogram(stat="identity",binwidth = 1) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

summary(pf$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

We can see that min residual sugar is 0.6 and max sugar is 65.8 .We can see that best amount to give sugar is approx 5.2.If we can increse more sugar quality of wine decrease.

ggplot(aes(x=chlorides,y=quality),data=pf)+
      geom_histogram(stat="identity",binwidth = 1) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

summary(pf$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

We can see that min chloride in 0.009 and max chloride is 0.346.We can see that by incresing the amount of chloride till median 0.043 .Its quality is incresing.After that by incresing quantity of chlorides quality of wine decreses.

ggplot(aes(x=free.sulfur.dioxide,y=quality),data=pf)+
      geom_histogram(stat="identity",binwidth = 1) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

summary(pf$free.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

We can see that min free sulfur dioxide in 2.00 and max free sulfur dioxide is 289.00.We can see that by incresing the amount of free sulfur dioxide till median 34 .Its quality is incresing.After that by incresing quantity of free sulfur dioxide quality of wine decreses.

ggplot(aes(x=total.sulfur.dioxide,y=quality),data=pf)+
      geom_histogram(stat="identity",binwidth = 1) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

summary(pf$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

We can see that min total sulfur dioxide in 9.00 and max total sulfur dioxide is 440.00.We can see that by incresing the amount of total sulfur dioxide till median 134 .Its quality is incresing.After that by incresing quantity of total sulfur dioxide quality of wine decreses.

ggplot(aes(x=density,y=quality),data=pf)+
      geom_histogram(stat="identity",binwidth = 1) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

summary(pf$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

We can see that min density in 0.9871 and max density is 1.039.We can see that by incresing the amount of density till median 0.9937 .Its quality is incresing.After that by incresing quantity of density quality of wine decreses.

ggplot(aes(x=pH,y=quality),data=pf)+
      geom_histogram(stat="identity",binwidth = 1) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

summary(pf$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

We can see that min pH in 2.720 and max pH is 3.82.We can see that by incresing the amount of pH till median 3.180 ,Its quality is incresing.After that by incresing quantity of pH quality of wine decreses.

ggplot(aes(x=sulphates,y=quality),data=pf)+
      geom_histogram(stat="identity",binwidth = 1) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

summary(pf$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

We can see that min sulphates in 0.220 and max sulphates is 1.080.We can see that by incresing the amount of sulphates till median 0.4700 ,Its quality is incresing.After that by incresing quantity of sulphates quality of wine decreses.

ggplot(aes(x=alcohol,y=quality),data=pf)+
      geom_histogram(stat="identity",binwidth = 1) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

summary(pf$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

We can see that min alcohol in 8.2 and max alcohol is 14.2.We can see that by incresing the amount of alcohol till median 10.40 ,Its quality is incresing.After that by incresing quantity of alcohol quality of wine decreses.

summary(pf$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
names(pf)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
pf$alcohol_percentage<-cut(pf$alcohol,c(8,10,12,14,16))
head(pf$alcohol_percentage)
## [1] (8,10]  (8,10]  (10,12] (8,10]  (8,10]  (10,12]
## Levels: (8,10] (10,12] (12,14] (14,16]

Univariate Analysis

What is the structure of your dataset?

Answer:- This tidy data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

(worst) —————-> (best) Quality:3,4,5,6,7,8,9 Its a continuous number

Other observations:

Average quality of wine is 5.878 By incresing the amount of ingredient till their median value quality of wine incresing. By increasing the quantity of ingredient above their medain quality of wine decreses. ## What is/are the main feature(s) of interest in your dataset? The main features in the data set are alcohole and quality I’d like to determine which ingredient are best for predicting the quality of a wine I suspect alcohol and some combination of the other variables can be used to build a predictive model to quality of wine ## What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates and alcohol likely contribute to the quality of a white wine I think alcohole contribute most to the quality after researching information on quality of wine

Did you create any new variables from existing variables in the dataset?

I created a variable for the alcholor percentatage group of wine using the alcohol. This arose in the bivariate section of my analysis when I explored how the quality of a wine varied with its alcohol percentage. At first alcohol percentage grouping was calculated by diving the alcohol percentage into four groups

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I calculated the alcohol percentage distribution and find its correlation.Since it is strongly related to quality of wine.I have calculated pH distribution and find its correlation.Its correlated to wine quality.

head(pf)
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality alcohol_percentage
## 1       6             (8,10]
## 2       6             (8,10]
## 3       6            (10,12]
## 4       6             (8,10]
## 5       6             (8,10]
## 6       6            (10,12]

The dimensions of a white wine tend to correlate with each other. The longer one dimension, then the quality of wine is overall. The dimensions also correlate with other variables. Price correlates strongly with alcohol and other variable also

set.seed(1000)
ggpairs(pf,
        lower = list(continuous=wrap("points",shape=I('.'))),
          upper = list(combo=wrap("box",outlier.shape=I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cor.test(pf$alcohol,pf$quality)
## 
##  Pearson's product-moment correlation
## 
## data:  pf$alcohol and pf$quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747
cor.test(pf$alcohol,pf$pH)
## 
##  Pearson's product-moment correlation
## 
## data:  pf$alcohol and pf$pH
## t = 8.5601, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.09374446 0.14893205
## sample estimates:
##       cor 
## 0.1214321
cor.test(pf$pH,pf$quality)
## 
##  Pearson's product-moment correlation
## 
## data:  pf$pH and pf$quality
## t = 6.9917, df = 4896, p-value = 3.081e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07162022 0.12707983
## sample estimates:
##        cor 
## 0.09942725

Create a new function to transform the X variable

cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),inverse = function(x) x^3)

cuberoot function

From a subset of the data,fixed acidity ,total sulphur dioxide do not seem to have strong correlations with quality, but alcohol and pH are moderately correlated with quality. I want to look closer at scatter plots involving quality and some other variables like fixed acidity,alcohol,pH.

ggplot(aes(fixed.acidity,quality),data=pf)+
       geom_point()+
       scale_x_continuous(trans = cuberoot_trans(),limits = c(6,14),
                          breaks = c(6,8,10,12,14))+
       scale_y_continuous(trans = cuberoot_trans(),limits = c(2,10),
                          breaks = c(2,4,6,8,10))+
     ggtitle('Quality(log10) by cube-root of fixed acidity')
## Warning: Removed 575 rows containing missing values (geom_point).

ggplot(aes(free.sulfur.dioxide,quality),data=pf)+
       geom_point()+
       scale_x_continuous(trans = cuberoot_trans(),limits = c(0,100),
                          breaks = c(0,20,40,60,100))+
       scale_y_continuous(trans = cuberoot_trans(),limits = c(2,10),
                          breaks = c(2,4,6,8,10))+
     ggtitle('Quality(log10) by cube-root of free sulphur dioxide')
## Warning: Removed 17 rows containing missing values (geom_point).

As free sulphur dioxide quantity increases, the variance in quality increases. We can see that till median value of sulphur dioxide Quality increses more.After that its start decresing.

cor.test(pf$free.sulfur.dioxide,pf$quality)
## 
##  Pearson's product-moment correlation
## 
## data:  pf$free.sulfur.dioxide and pf$quality
## t = 0.57085, df = 4896, p-value = 0.5681
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01985292  0.03615626
## sample estimates:
##         cor 
## 0.008158067
ggplot(aes(x=pf$alcohol_percentage,y=quality),data=pf)+
      geom_boxplot() 

summary(pf$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Ideal wine quality have the median 10.40 . This seems really unusual since I would expect quality with an ideal alcohol percentage to have a higher quality. compared to the other groups. There are many outliers. The variation in quality tends to increase as alcohol percentage improves and then decreases for wine quality with increse in alcohol percentage above median value.

summary(pf$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820
pf$pH_group<-cut(pf$pH,c(2.720,3.02,3.32,3.62,3.82))
head(pf$pH_group)
## [1] (2.72,3.02] (3.02,3.32] (3.02,3.32] (3.02,3.32] (3.02,3.32] (3.02,3.32]
## Levels: (2.72,3.02] (3.02,3.32] (3.32,3.62] (3.62,3.82]
ggplot(aes(x=pf$pH_group,y=quality),data = pf)+
      geom_boxplot()

summary(pf$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Ideal wine quality have the median 10.40 . This seems really unusual since I would expect quality with an ideal pH percentage to have a higher quality. compared to the other groups. There are many outliers. The variation in quality tends to increase as pH improves and then decreases for wine quality with increse in pH above median value. # Bivariate Analysis ## Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality correlates strongly with alcohol percentage and the pH.

As alcohol percenate increases, the variance in quality increases till median value. In the plot of quality vs alcohol.Quality of wine increases till median value of alcohol after that it’s start decreasing. The relationship between alcohol and quality is not regular.

Based on the R^2 value, alcohol explains about 43 percent of the variance in price. Other ingredients of interest can be incorporated into the model to explain the variance in the quality

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The alcohol percentage and quality tend to correlate with each other. The higher the alcohol percentage , then the greater the pH .

What was the strongest relationship you found?

The quality of a wine is positively and strongly correlated with alcohol and pH The variables fixed.acidity and free.sulfur.dioxide also correlate with the quality but less strongly than pH and alcohol. Either pH or alcohol could be used in a model to predict the quality of alcohol, however, both variables should not be used since they are measuring the same quality and show perfect correlation.

Multivariate Plots Section

ggplot(aes(x=quality,y=free.sulfur.dioxide),
       data=subset(pf,!is.na(alcohol_percentage)))+
      geom_line(aes(color=alcohol_percentage),stat='summary',fun.y=median)

we can see that that initial quality of wine increses by decresing the free sulphur dioxide till quality 4.After this we can see that by incresing quality of wine increses by incresing the free sulphur dioxide then again by decresing by free sulphur dioxide .Quality value increses.

ggplot(aes(x=quality,y=free.sulfur.dioxide),
       data=subset(pf,!is.na(pH_group)))+
      geom_line(aes(color=pH_group),stat='summary',fun.y=median)

We can see that that initial quality of wine increses by decresing the free sulphur dioxide till quality 4.After this we can see that by incresing quality of wine increses by incresing the free sulphur dioxide then again by decresing by free sulphur dioxide .Quality value increses.

ggplot(aes(x=quality,y=volatile.acidity),
       data=subset(pf,!is.na(pH_group)))+
      geom_line(aes(color=pH_group),stat='summary',fun.y=median)

names(pf)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "alcohol_percentage"   "pH_group"

We can see that that initial quality of wine increses by incresing the volatile acidity till quality 4.Except pH group (3.02,3.32).After this we can see that increses quality of wine increses by decresing volatile acidity then again by incresing volatile acidity ,Quality value increses. ### Quality vs. fixed sulphur dioxide and alcohol

library(RColorBrewer)

ggplot(aes(free.sulfur.dioxide,quality,color=alcohol_percentage),data=pf)+
       geom_point(alpha=0.5,size=1,position = 'jitter')+
       scale_color_brewer(type='div',
            guide=guide_legend(title='Alcohol Percentage',reverse = T,
                               override.aes = list(alpha=1,size=2)))+
       scale_x_continuous(trans = cuberoot_trans(),limits = c(0,100),
                          breaks = c(0,20,40,60,100))+
       scale_y_continuous(trans = cuberoot_trans(),limits = c(2,10),
                          breaks = c(2,4,6,8,10))+
     ggtitle('Quality(log10) by cube-root of free sulphur dioxide and alcohol')
## Warning: Removed 19 rows containing missing values (geom_point).

The plot indicates that a horizontal model could be constructed to quality of wine of variables using log10(quality) as the outcome variable and cube-root of free sulphur dioxide as the predictor variable.We can see that from the above two graph quality of wine increses till the median value of alchol percentage.By incresing alcohole percentage more than its median value wine quality decreses.

ggplot(aes(free.sulfur.dioxide,quality,color=pH_group),data=pf)+
       geom_point(alpha=0.5,size=1,position = 'jitter')+
       scale_color_brewer(type='div',
            guide=guide_legend(title='pH group',reverse = T,
                               override.aes = list(alpha=1,size=2)))+
       scale_x_continuous(trans = cuberoot_trans(),limits = c(0,100),
                          breaks = c(0,20,40,60,100))+
       scale_y_continuous(trans = cuberoot_trans(),limits = c(2,10),
                          breaks = c(2,4,6,8,10))+
     ggtitle('Quality(log10) by cube-root of free sulphur dioxide and pH group')
## Warning: Removed 18 rows containing missing values (geom_point).

ggplot(aes(volatile.acidity,quality,color=pH_group),data=pf)+
       geom_point(alpha=0.5,size=1,position = 'jitter')+
       scale_color_brewer(type='div',
            guide=guide_legend(title='pH group',reverse = T,
                               override.aes = list(alpha=1,size=2)))+
       scale_x_continuous(trans = cuberoot_trans(),limits = c(0,2),
                          breaks = c(0.5,1,1.5,2))+
       scale_y_continuous(trans = cuberoot_trans(),limits = c(2,10),
                          breaks = c(2,4,6,8,10))+
     ggtitle('Quality(log10) by cube-root of  volatile acidity and pH group')
## Warning: Removed 1 rows containing missing values (geom_point).

ggplot(aes(x=pf$pH_group,y=quality),data = pf)+
      geom_boxplot()

We can see that quality of wine increses by incresing alcohol value till median value of alcohol value.Ater incresing alcohol value more that its median value quality of wine decreses

ggplot(aes(x=pf$alcohol_percentage,y=quality),data = pf)+
      geom_boxplot()

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Idealy wines also have the have average alcohol and pH group. The variance of wine quality increses till median of alcohol percentage after that it start decresing.

The last two plots from the Multivariate section suggest that I can build a linear model and use those variables in the model to predict quality of alcohol. The results of the model are summarized below.

Were there any interesting or surprising interactions between features?

Increase and decrese value of quality of alcohol.You can see that by incresing the alcohol percentage till its median value quality of alcohol increses.After incresing alcohol percentage more than its median value.It’s quality decreses.

Final Plots and Summary

Plot One

ggplot(aes(x=pH,y=quality),data=pf)+
      geom_histogram(stat="identity",binwidth = 1) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Description One

We can see that quality of wine increses by incresing pH value till median value of pH value.Ater incresing pH value more that its median value quality of wine decreses

Plot two

ggplot(aes(x=pf$alcohol_percentage,y=quality),data=pf)+
      geom_boxplot() 

ggplot(aes(x=quality,y=volatile.acidity),
       data=subset(pf,!is.na(pH_group)))+
      geom_line(aes(color=pH_group),stat='summary',fun.y=median)

## Description Two We can see that from the above two graph quality of wine increses till the median value of alchol percentage.By incresing alcohole percentage more than its median value wine quality decreses. We can also see that quality of wine increses by incresing pH value till median value of pH value.Ater incresing pH value more that its median value quality of wine decreses

Plot Three

ggplot(aes(free.sulfur.dioxide,quality,color=alcohol_percentage),data=pf)+
       geom_point(alpha=0.5,size=1,position = 'jitter')+
       scale_color_brewer(type='div',
            guide=guide_legend(title='Alcohol Percentage',reverse = T,
                               override.aes = list(alpha=1,size=2)))+
       scale_x_continuous(trans = cuberoot_trans(),limits = c(0,100),
                          breaks = c(0,20,40,60,100))+
       scale_y_continuous(trans = cuberoot_trans(),limits = c(2,10),
                          breaks = c(2,4,6,8,10))+
     ggtitle('Quality(log10) by cube-root of free sulphur dioxide and alcohol')
## Warning: Removed 19 rows containing missing values (geom_point).

Description Three

The plot indicates that a horizontal model could be constructed to quality of wine of variables using log10(quality) as the outcome variable and cube-root of free sulphur dioxide as the predictor variable.We can see that from the above two graph quality of wine increses till the median value of alchol percentage.By incresing alcohole percentage more than its median value wine quality decreses.

Reflection

The white wine data set contains information on almost 4898 white wines with 11 variables. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of diamonds across many variables and created a linear model to predict diamond prices.

This tidy data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The main features in the data set are alcohole and quality I’d like to determine which ingredient are best for predicting the quality of a wine I suspect alcohol and some combination of the other variables can be used to build a predictive model to quality of wine

(worst) —————-> (best) Quality:3,4,5,6,7,8,9 Its a continuous number

There was a clear trend between the volume or carat weight of a diamond and its price. I was surprised that depth or table did not have a strong positive correlation with price, but these variables are likely to be represented by categorical variables: color, cut, and clarity. I struggled understanding the decrease in median price as the level of cut and clarity improved, but this became more clear when I realized that most of the data contained ideal cut diamonds. For the linear model, all diamonds were included since information on price, carat, color, clarity, and cut were available for all the diamonds. After transforming price to log scale and taking the cube root of carat, the model was able to account for 98.4% of the variance in the dataset.

Some limitations of this model include the source of the data.Given the data set has only 4898 wines data availabel.Which is not very large.These prediction might get wrong.Since it is not population data.To Investigate the data further I would like to gather much more data.I will train the data .I would like to analyze the data which factor describes more quality of wine.I would like to see which combination of ingriedents customers like more.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.